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Abstract 

This paper reports on the multiple difficulties inherent in the long-term archiving of 
digital data, and in particular on the different possible causes of definitive data loss. 

It defines the basic principles which must be respected when creating long-term archives. 
Such principles concern both the archival systems and the data. 

The archival systems should have two primary qualities: independence of architecture 
with respect to technological evolution, and genericness, i.e., the capability of ensuring 
identical service for heterogeneous data. These characteristics are implicit in the 
Reference Model for Archival Services, currently being designed within an ISO-CCSDS 
framework. A system prototype has been developed at the French Space Agency (CNES) 
in conformance with these principles, and its main characteristics will be discussed in this 
paper. 

Moreover, the data archived should be capable of abstract representation regardless of the 
technology used, and should, to the extent that it is possible, be organized, structured and 
described with the help of existing standards. The immediate advantage of 
standardization is illustrated by several concrete examples. 

Both the positive facets and the limitations of this approach are analyzed. The advantages 
of developing an object-oriented data model within this context are then examined. 
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1. Introduction 


The observations and data gathered during spaceborne scientific payloads carried on 
board satellites or interplanetary probes are archived on the ground in the form of digital 
data, which are generally accessible to the PI teams or to larger communities. After more 
30 years of experience in this field, the following facts have been observed: 

- the volume represented by this data is always on the increase; 

- some of the data has been lost because the physical medium became unreadable; 

- other data has been or is likely to be lost because its structure was dependent on 
operating systems which are now obsolete; 

- other data has been lost because an exhaustive and correct data description was no 
longer available; 

- it has turned out to be impossible to keep as many access softwares in operating order as 
there are sets of data - essentially for reasons related to cost. Consequently, some of the 
data, while not actually lost, is no longer accessible; 

- knowledge - or rather human expertise - concerning the oldest data is quickly 
disappearing. 

Within this context, a safeguard plan for conserving data archived on 70,000 magnetic 
tapes at the French Space Agency (CNES) has recently been implemented. This data 
represents, for the most part, a priceless scientific heritage which should remain of great 
interest for several decades to come, or even longer. The cost of producing this data 
represents, in fact, the cost of all scientific space missions since the 1960's, which is, 
needless to say, enormous. 

In practice, most of the observations made above are valid in many other fields 
(scientific, cultural, audio-visual, industrial, etc.). They boil down to the contradictions 
between the need to archive data in the long term and the speed at which the technology 
being used becomes outdated. Generally speaking, the loss of digital data is very often 
'insidious', as the digital data is not physically ’visible’. Due to this fact, its degradation 
does not strike the mind as strongly as the deterioration of a book, for example, whose 
characters get less and less readable with time, or like an historical monument which 
crumbles to the ground. 

This analysis led us to undertake a thorough technical study of the problems posed by 
long-term archiving. We reached the conclusion that the setting up and maintenance of 
long-term archival services can only be achieved if certain stringent conditions are 
imposed on the archival systems and data. 
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In order to avoid any ambiguity in the vocabulary, let us specify that by the term archival 
system, we mean a hardware and software system responsible for the main archival 
functions: insertion of the data supplied by the data producers, conservation of the data 
and anything needed for interpreting it, access to the information concerning the data and 
dissemination of the data to the users. Such a system is itself a component of an archival 
service, which is the human organization which, in particular, maintains this system in 
operating order. 

Briefly, it may be said that archival systems should respect two main requirements: 

- the independence of their architecture with respect to technological evolution: any 
archival system relies on rapidly evolving technologies and must thus be able to evolve 
along with these technologies. Nevertheless, its architecture should be such that 
technological evolutions in one field (the physical media containing the data, for 
example, or else the user interface) should not have repercussions leading to an 
uncontrollable chain reaction throughout the system. The system components must thus 
not be correlated among themselves. This characteristic led us to reflect on the modelling 
of archival services, and to design a Reference model for these services [4]. 

- the genericness , i.e. the capability of ensuring an identical service for heterogeneous 
data. The primary aim of this genericness is to reduce the volume of software to be 
maintained. 

At the same time, the data should also take into account two other requirements. 

- its independence with respect to any technology: the data should be capable of an 
abstract representation which is completely independent of the technology being used, 

- the application of standards to the data in terms of structuring, organization, 
description, etc. The application of such standards is a necessary condition if the 
objective of genericness, defined at the system level, is to be attained. 

An archival system prototype was developed by CNES in 1995 to test such an approach. 

later in this article we will analyze - through the lessons learned in our experiments - the 
consequences of the requirements specified for the system level and for the data and 
metadata level. 

2. The problem on the system level 

It has been seen that any archival system is based on rapidly evolving technology. Our 
purpose should thus be to construct a modular system in which each component is 
sufficiently independent from the others to be able to evolve individually without calling 
either the system architecture or the principles of inter-component communication into 
question. 
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At the level of the MMI component, the emergence of W3 is a typical example of rapid 
technological change. The use of XI 1 -type client-server systems has very quickly 
become outdated. Many access systems have thus become obsolete. Only those systems 
designed with an independent MMI component were able to adapt easily to this new 
technology. 

These considerations first led us to look for a solution in terms of a general model for a 
long-term archival service which would be totally independent of technological advances. 
Other teams, in particular in the USA, have taken a similar approach and it soon became 
clear that we shared a common view of the problem on the first level of the model. 

This first level of the archival service model was no more than an outline and an 
elaboration of an actual Reference Model which is currently being used within an ISO- 
CCSDS framework [4], We shall thus limit ourselves to a rough description of this first 
model, and then go on to describe a system prototype developed by us in conformance 
with this preliminary model version, along with the lessons learned from this prototype. 

It seems useful first of all to define the limits of an archival service and to identify the 
external elements interacting with this service. The following four external elements may 
thus be distinguished: 

• the data producers, 

• the data users, 

• the system administrator, 

• the authority responsible for choices and for decisions regarding policy and financing. 

At the level of the model diagram shown below (figure 1), we have not included the latter 
since its interaction at the level of the archival system is limited. 

The service itself consists of five major functions (or sub-services) with respect to the 
data : (see figure 1 ) 

• ingest, which serves as the interface between the data producers and the service. This 
function controls the conformance of data and metadata provided by the producers 
with respect to the requirements defined by the archival service (standardization, etc.) 
and performs the actual insertion of this data and metadata into the service. 

• physical data storage, involving an interface which hides the internal architecture. 
This storage may be designed to comply with the IEEE Mass Storage System 
Reference Model, 

• data management, based on the organization of metadata, 

• access to metadata which makes it possible for the service to check the user's access 
rights and for the user to be aware of the available data and to define a query. 
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• dissemination which makes it possible to retrieve the data from the storage service, to 
extract the parts in which the user is interested, and to deliver these parts to him. 

Note: The distinction between external and internal elements must be made clear before 
the limits of a long-term data archival and access system may be defined. In our 
approach, the formatting of the data into a normalised and long-lasting format is done by 
the data producer rather than by the archival system. Similarly, any processing for the 
purpose of data analysis is the responsibility of the data user, while the service is 
uniquely responsible for delivering the data corresponding to the user's query. 



Figure 1: Preliminary view of the archival service model 


Architecture of the first prototype 

An archival system prototype was designed essentially on the basis of the modelling 
principles defined above, by making use of components already installed and used by 
CNES. Each component will be specified below, along with its description. 

The system proposes access to chronologically ordered data. WWithin the system, a set 
of data' is characterized by a set of homogeneous data acquired in the same experiment, 
and having undergone the same processing. The principal components of a user query are 
the set of data and one or more time intervals with which it is associated. 
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• The storage system used is STAF ( Service de Transfert et d’Archivage des Fichiers, or 
File Transfer and Archival Service"). This system for the long-term physical 
preservation of data was set up at CNES 2 years ago. It functions in a heterogeneous 
environment, and is based on a client-server architecture. The client, responsible for 
data archiving and retrieval, functions on different host systems (UNIX, NOS-VE, 
etc.). This type of architecture makes the storage technology, and thus its evolution, 
invisible to the user (see figure 2). 
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Figure 2: STAF diagram 


• Metadata is managed by means of an ORACLE relational data base (cf. figure 3). This 
concerns for the most part the set of references for data placed in the storage service. 
The data base also manages : 

- the data protection, 

- the management of browse data (quick-looks), 

- the resources and quotas allotted to each user. 
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Figure 3: The main elements managed by the data base 

• The data ingest service inserts only the metadata into the system, at the level of the 
data manager. This is performed by software based on the Oracle SQL*loader tools. 
The insertion of the data itself at the level of the storage function is independently 
performed by the data producers. 

• The access service is based on a WWW server. The latter is linked to the data base by 
means of a cgi-bin written in Pro*C. The WWW server is responsible for checking the 
user's identification. This service is a critical point in the system, as it provides system 
access throughout the Internet. It has thus undergone a security study, to prevent any 
ill-intentioned intrusion. 

• The data dissemination service is the set of generic programs enabling both the 
retrieval of data from the archive and its delivery, either by means of an FTP onto the 
user station or, at the level of the server station, into a W3 directory owned by that 
user. These programs, written in C, depend on a standardized date format, and use 
EAST descriptions [2] to extract the parts in which the user is interested from the 
archived files (see § 3.2). 
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Figure 4 : The prototype architecture 


3. The problem as far as the data and metadata are concerned 

3.1 Fundamental rules for data independence with respect to the technology 

The first fundamental rule, which is clearly necessary, is the independence of the data 
with respect to the machines, the operating systems and its environment in general. Any 
digital data item can be abstractly represented by a sequence of bits divided into fields. 
Each field may be subdivided into sub-fields, and the latter may be further subdivided 
until indivisible units of information are reached. The first rule requires in particular: 

- that the bit sequence contain no information inserted by the operating system which 
created it: only the relevant bits defined by the user shall be included. Consequently, the 
use of any file structure into which the operating system has inserted information to help 
in administration or control is strictly forbidden. 

- that the coding of elementary fields be performed in conformance with recognized 
standards (ISO/IEC 646 for characters, IEEE for floating-point numbers, standard 
representations of images and graphs, etc.) and that any representation specific to a given 
manufacturer be prohibited. 
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The second rule to be applied concerns the necessity of having an exact and exhaustive 
description of the bit sequence available: the position of each elementary field, a 
description of this field's coding, the nature and meaning of the information contained in 
this field. This second rule prohibits, for example, reading and writing data through the 
blind use of software tools which do not provide thorough knowledge of the bit sequence 
in its abstract representation. 

While these requirements are elementary, they are far from having always been 
respected. They apply both to data and to metadata, and are necessary if the aim of data 
independence is to be attained. They apply to the abstract representation of the data rather 
than to its actual physical storage, which depends on the technology available at a given 
moment. They may naturally meet with difficulties related to a lack of standards in a 
given field. 

3.2 The application of advanced data standards and the key to system genericness 

The infinite diversity of information representations which may be imagined is such that 
it is certainly useless to try to provide advanced and generic data access facilities without 
first investing in the standardization of these representations. Let us consider two simple 
examples which we have encountered, concerning the standardization of times and dates 
on the one hand, and the standardization of descriptions on the other hand. 

Standardization of times and dates : in certain scientific disciplines such as Space 
Physics, data is often organized chronologically. We discovered in the older data that the 
variety of time and date representation formats was almost as large as the number of 
existing data sets. Given such a situation, when a user is interested in data for a given 
time frame, two options exist : 

- either to supply the archived files containing this time frame, which is hardly 
satisfactory, 

- or to develop and implement at the archival system level a specific extraction program 
for each set of data, an unrealistic approach with respect to the long-term perspective. 

It quickly became obvious that a standardization of times and dates would resolve this 
problem in a satisfactory manner. We therefore selected the standardization proposed by 
CCSDS [1], which is more complete than the ISO standard in this field. For our first 
prototype, we were able to develop a general program for the extraction of data 
corresponding to one or more time frames defined by the user from one or more files. 
This program makes it possible, as shown in figure 5 below, to extract only that data 
which corresponds strictly to the time frame requested by the user. The extraction 
function is entirely independent of the archive structure: 
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Figure 5: Extraction of chronologically ordered data 


Standardized data descriptions : we discovered that the data was often described either 
incompletely (certain fields are left out of the description) or incorrectly (due to changes 
in the data creation program which were not carried over to the description documents). 
Moreover, the form of the description generally differs from one project to the next. The 
beginnings of a solution to this difficult and crucial problem in long-term archiving were 
found through the standardization of data description languages. 

Our experiment, in our first prototype, was based on the EAST language (Enhanced Ada 
SubseT, [2], This is a formal language around which certain general tools have been or 
are currently being developed. Worth mentioning in this field, in particular, are the Data 
Description Record Generator, the Data Generator, the Data Interpreter and the Data 
Formatter [3]. 

The interactive creation of data descriptions in EAST is performed with the help of a 
graphical interface, and the use of these descriptions for reading and writing data makes it 
possible to guarantee, during construction, the consistency between the data and its 
description. 

Within this framework, we experimented with the use of a generic tool enabling the user 
to select a subset of information fields present in the archive. 
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The use of this tool involves two stages: 

- a first stage in which the user selects the fields in which he is interested. A hierarchical 
tree representation of the different data fields is constructed on the basis of the EAST 
description. Using this representation, the user identifies and marks the fields in which he 
is interested (WWW interface). 

- a second stage in which data is extracted from an archive and then 'filtered' so as to 
preserve only those fields requested by the user. (cf. Figure 6) 




Figure 6 : Field extraction 

The above are two meaningful examples with which we have experimented. They 
illustrate the correlation between the level of data and metadata standardization which we 
were able to attain and our capacity to preserve and keep the data accessible in the long- 
term. 

4. Learned lessons 

Positive points to be retained from the experiment with this first prototype 

The aim of genericness within the field of chronologically organized data was achieved. 
Once the data producers began to respect the requirements set forth with respect to time 
and date standardization, we noted that it became very easy to access new data and hence 
that our approach had not simply been idealistic. At the present time, the system offers 
access to 32 different sets of data acquired during 5 space missions (INTERBALL, 
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Sweden VIKING, ISEE1, VOYAGER, GEOS). No specific tool had to be developed to 
make this data accessible through our prototype. 

The service provided to the user is much better than the simple extraction of data from an 
archive, and since the volume of data transmitted to the user corresponds only to that data 
in which he is actually interested, there is a much more free space on the network to 
perform these transmissions. 

Scientists do not naturally apply standards simply on principle. On the other hand, in a 
case such as that of times and dates, when the experiment teams applied the standard at 
the moment of data production, they perceived it as a way of immediately obtaining a 
better access service. 

The limitations 

The use of a relational model is the main limitation of our system. Adding a new 
selection criterion other than those defined at the moment of installation involves serious 
modifications both in the relational model of the metadata management function and in 
presentation at the level of the access function. This limitation curbs the open-endedness 
of the system. 

5. Conclusion: towards an object-oriented data model 

Without going into the details of work currently being performed on the object-oriented 
modelling of a long-term archival service, we shall explain a few important concepts: 

• Data with shared characteristics can be collected into sets known as 'collections'. 
These collections can then be grouped together into 'collection groups', the collection 
groups themselves can be grouped together as well, and so on. Moreover, a collection 
may belong to several distinct collection groups. This representation led us to the 
construction of a directed graph. 

• In order to define a query, a user will navigate through a directed graph which groups 
the data together according to scientific field, selection criteria or any other shared 
characteristic. To reach the data itself, each group offers selection criteria by means of 
which the user may select a given daughter group. This approach will provide the user 
with an infinite number of possibilities when searching for interesting data: if he 
wishes to create a new search route, he need only install the new groups needed to 
propose it. 

• The lowest level group is a data collection grouping together a set of elementary 
logical data objects, while these logical objects are themselves made up of storage 
objects, i.e. in the general case, files. This approach, which may seem complicated at 
first glance, provides the system with considerable flexibility. A collection could 
correspond to a virtual data set, created at the same moment as it is being accessed. 
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For example, a set of image data can, at the level of storage, be made to correspond 
either to files containing several images or to files containing only a part of an image, 
through this approach, this becomes invisible to the system, which enables access to a 
collection of images reconstituted during this access, either by cutting up a file or by 
concatenating several files. 

• Selection criteria for specific cases are available at the level of the groups or 
collections, and the same approach could also permit transformation or delivery 
criteria which could be applied to the data collections. 
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